Design of CKIP Chinese Word Segmentation System
نویسندگان
چکیده
In this paper, we describe the design of the CKIP Chinese word segmentation system and analyse its performance. The system utilizes a modulized approach. Independent modules were designed to solve the problems of segmentation ambiguities and identifying unknown words. Segmentation ambiguities are resolved by a hybrid method of using heuristic and statistical rules. Regular-type unknown words are identified by regular expressions and irregular types of unknown words are detected first by their occurrence and then extracted by morphological rules with statistical and morphological constraints. At the first international Chinese Word Segmentation Bakeoff, the CKIP system was tested on open and closed tracks of Beijing University (PK) and Hong Kong CityU (HK). The evaluation results show our system performed very well on both the HK open track and closed tracks; and was acceptable on the PK tracks.
منابع مشابه
Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff
In this paper, we roughly described the procedures of our segmentation system, including the methods for resolving segmentation ambiguities and identifying unknown words. The CKIP group of Academia Sinica participated in testing on open and closed tracks of Beijing University (PK) and Hong Kong Cityu (HK). The evaluation results show our system performs very well in either HK open track or HK c...
متن کاملAn Improved Chinese Word Segmentation System with Conditional Random Field
In this paper, we describe a Chinese word segmentation system that we developed for the Third SIGHAN Chinese Language Processing Bakeoff (Bakeoff2006). We took part in six tracks, namely the closed and open track on three corpora, Academia Sinica (CKIP), City University of Hong Kong (CityU), and University of Pennsylvania/University of Colorado (UPUC). Based on a conditional random field based ...
متن کاملIntroduction to CKIP Chinese Spelling Check System for SIGHAN Bakeoff 2013 Evaluation
In order to accomplish the tasks of identifying incorrect characters and error correction, we developed two error detection systems with different dictionaries. First system, called CKIP-WS, adopted the CKIP word segmentation system which based on CKIP dictionary as its core detection procedure; another system, called G1-WS, used Google 1T uni-gram data to extract pairs of potential error word ...
متن کاملMaximum Entropy Word Segmentation of Chinese Text
We extended the work of Low, Ng, and Guo (2005) to create a Chinese word segmentation system based upon a maximum entropy statistical model. This system was entered into the Third International Chinese Language Processing Bakeoff and evaluated on all four corpora in their respective open tracks. Our system achieved the highest F-score for the UPUC corpus, and the second, third, and seventh high...
متن کاملNanjing Normal University Segmenter for the Fourth SIGHAN Bakeoff
This paper expounds a Chinese word segmentation system built for the Fourth SIGHAN Bakeoff. The system participates in six tracks, namely the CityU Closed, CKIP Closed, CTB Closed, CTB Open, SXU Closed and SXU Open tracks. The model of Conditional Random Field is used as a basic approach in the system, with attention focused on the construction of feature templates and Chinese character categor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of Chinese Language and Computing
دوره 14 شماره
صفحات -
تاریخ انتشار 2004